NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Distilling Vision-Language Models on Millions of Videos

https://doi.org/10.1109/CVPR52733.2024.01245

Zhao, Yue; Zhao, Long; Zhou, Xingyi; Wu, Jialin; Chu, Chun-Te; Miao, Hui; Schroff, Florian; Adam, Hartwig; Liu, Ting; Gong, Boqing; et al (June 2024, CVPR)

Full Text Available
Detecting twenty-thousand classes using image-level supervision

Zhou, Xingyi; Girdhar, Rohit; Joulin, Armand; Krähenbühl, Philipp; Misra, Ishan (October 2022, European Conference on Computer Vision)

Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not need complex assignment schemes to assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic obtains 41.7 mAP when evaluated on all classes, or only rare classes, hence closing the gap in performance for object categories with few samples. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.
more » « less
Full Text Available
Simple Multi-dataset Detection

https://doi.org/10.1109/cvpr52688.2022.00742

Zhou, Xingyi; Koltun, Vladlen; Krahenbuhl, Philipp (June 2022, CVPR)

Full Text Available
Global Tracking Transformers

https://doi.org/10.1109/cvpr52688.2022.00857

Zhou, Xingyi; Yin, Tianwei; Koltun, Vladlen; Krahenbuhl, Philipp (June 2022, CVPR)

Full Text Available
Neural Point Process for Learning Spatiotemporal Event Dynamics

Zihao Zhou, Xingyi Yang (January 2022, Annual Conference on Learning for Dynamics and Control)

Full Text Available
Multimodal Virtual Point 3D Detection

Yin, Tianwei; Zhou, Xingyi; Krähenbühl, Philipp (December 2021, Advances in neural information processing systems)

Full Text Available
Center-Based 3D Object Detection and Tracking

Yin, Tianwei; Zhou, Xingyi; Krähenbühl, Philipp (June 2021, IEEE Conference on Computer Vision and Pattern Recognition)
null (Ed.)
Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, CenterPoint outperforms all previous single model methods by a large margin and ranks first among all Lidar-only submissions.
more » « less
Full Text Available
Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency

Zhou, Xingyi; Karpur, Arjun; Gan, Chuang; Luo, Linjie; Huang, Qixing (September 2018, European Conference on Computer Vision)

In this paper, we introduce a novel unsupervised domain adaptation technique for the task of 3D keypoint prediction from a single depth scan or image. Our key idea is to utilize the fact that predictions from different views of the same or similar objects should be consistent with each other. Such view consistency can provide effective regularization for keypoint prediction on unlabeled instances. In addition, we introduce a geometric alignment term to regularize predictions in the target domain. The resulting loss function can be effectively optimized via alternating minimization. We demonstrate the effectiveness of our approach on real datasets and present experimental results showing that our approach is superior to state-of-the-art general-purpose domain adaptation techniques.
more » « less
Full Text Available

Search for: All records